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Abstract. We describe a new algorithm that computes the nth Bernoulli 
number in 0(n 1 ' 333 "'+°W) bit operations. This improves on previous algo- 
rithms that had complexity 0(n 2+0 < 1 )). 



in 



1. Introduction 
The Bernoulli numbers Bq, B\, B 2 , ■ ■ ■ are rational numbers denned by 

2-^ fci 1 ■ 



e* - 1 ^ fc! 

fe=0 

Every odd- index Bernoulli number is zero except for B\ = —1/2. The von Staudt- 
Clausen theorem states that the denominator of B 2n is precisely the product of the 
primes p such that p — 1 2n, and Eulcr's formula 

(i) ^« = (-ir +1 |^c(2n), 

where £(s) = 53fc=i ^ _;5 i s the Riemann zeta function, implies that the number of 
bits in the numerator of B 2n is 0(n.logn). 

It is known that the first n Bernoulli numbers may be computed simultaneously 
in 0(n 2 \og 2+ °^ n) bit operations |BH11) . This bound is optimal up to logarithmic 
factors, as the total number of bits being computed is 0(n 2 logn). (In this paper, 
"bit operations" always means number of operations in the multitape Turing model, 
as in |Pap94| .) 

The situation concerning computation of a single B n is less satisfactory. As B n 
has 0{n 1+0 ^) bits, it is conceivable that it can be computed in only O(n 1+0 ^ 1 ^) 
bit operations. However, the best published complexity bounds have the shape 
O(n 2+o( ^ 1 ^), which is essentially no better than computing all of -Bo, ■ • • , B n . This 
is achieved by two quite different algorithms: the "zeta function algorithm" , which 
approximates C(2n) in the right hand side of ([T]) via the Euler product (this has been 
rediscovered numerous times — see the discussion in (HarlOj ). and the multimodular 
algorithm introduced by the author in [HarlOj . 

In this paper we close two thirds of this gap. Our main result is: 

Theorem 1. Let 4 < a < ^. The Bernoulli number B n may be computed in 

U(n log v ' n) 

bit operations, using 

0(n 2 - 2a \og 1+2a n) 

bits of space. 
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The bounds are uniform in a, i.e. the implied O(-) constants do not depend on 
a, and the o(l) term approaches zero asn-> oo, independently of a. 
In particular, taking a = 1/3, we obtain the time bound 



space may be achieved; see Remark 01 

Our strategy may be summarised as follows. It is technically convenient to work 
with the 'integralised' Bernoulli numbers 



It follows immediately from the von Staudt-Clausen theorem and Fermat's little 
theorem that B* n G Z. Moreover log_B* = O(nlogn), so asymptotically the number 
of bits we must determine is the same as for B n . 

In [HarlOj . we gave a formula, sometimes known as a Voronoi congruence, that 
expresses B n (mod p) as a sum of 0(p) terms, for a prime p. Evaluating this for- 
mula for sufficiently many p, and combining the results using the Chinese remainder 
theorem, led to the overall complexity bound 0(n 2+0 W) for computing B n . 

Proposition[2]below may be interpreted as a p-adic generalisation of this formula, 
expressing B* (mod p s ) as a sum of 0{ps) terms. Evaluating this formula in the 
straightforward way has complexity 0(ps 2 ) (ignoring logarithmic factors), because 
we are doing arithmetic in Z/p s Z, whose elements have O(slogp) bits. This ap- 
proach has more flexibility than that of [H arlOj . as we may choose s as a function 
of p to optimise the total cost. Unfortunately, it turns out that this still leads to a 
quasi-quadratic complexity bound for computing B n . 

However, we make two key observations. First, provided p is not too small 
compared to s, we can use techniques of fast polynomial arithmetic to save a factor 
of roughly s in the evaluation of the formula of Proposition [2] This trick is already 
enough to lower the overall cost to O(n 3 ^ 2+0 ^) (see Remark |4}. Second, by some 
algebraic rearrangement and careful choice of coefficient rings, we may evaluate 
the formula of Proposition [2] for many primes simultaneously. This reduces the 
complexity further to O(n 4 ^ 3+0 ^). 

We do not know for what n an efficient implementation of these new algorithms 
would be faster than existing implementations of the quasi-quadratic algorithms. 
This is an interesting question for further study. 




B* = 2(1 - 2 n )B,, 



2. Congruences 



Proposition 2. Let n > s > 1 and let p be an odd prime. Let 




Then 



p-i 



^2(-iyj n - s F p (j) (mod/). 
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Proof. Our proof is modelled on [Coh07, Prop. 9.1.3]. 
The exponential generating function for the B* is 

ri 



E B n t n _ n _ 2B n _ 2t It _ 2t 

n\ n\ ^—^ nl e* — 1 e 2 * — 1 e t + 



n>0 n>0 n>0 

Define a sequence of 'integralised' Bernoulli polynomials by 

=it(t) B > n ~ k ez w> 

whose exponential generating function is given by 

n>0 ' \k>0 ' / \m>0 ' ) 

Now on one hand we have 

p-i n p-i 

^(-i)%(pt,i/ P ) = £ *L ^(-i)^;(j/ P )^ 

j=0 n>0 ' 3=0 

while on the other hand this sum is also equal to 



2te t 



p-i 



£(-D- 



2pte 



i j 



2pt 



p-i 



e pi _l_ 1 e pt + 



j=0 ' 3=0 ri>0 

Equating coefficients of £" and using Bq = we obtain 

p— 1 p— 1 n— 1 

j=0 j=0 k=0 

Truncating this sum modulo p s yields the desired congruence 



n 

k + 1 



B l+i3 



n—k—lpk 



□ 



3. Algorithms 



Let n > s > 1, and define 



s-l 



fe=0 



n 

k + 1 



a k+l x 



e z[x]. 



Note that F(x) depends on n and s, but (crucially) not on p. For any prime p > 3 
and any < j < p we have F p (j) — p s ~ 1 F(j /p). We obtain the following bounds 
for the coefficients of F(x) and for F p (j). 



Lemma 3. Let n > 1 and < k < n. Then 



n 

k + 1 



B 



k+1 



k+1 



Proof. For k = the assertion is that n < 7(n/ir). For even k > 2, we have 
£?k +1 = 0. For odd 1 < k < n, by ^ we have 



n 
k + 1 



B 



k+1 



{k + l)!(n - fc - 1)! 

n 1 1 

7 r 77 — f-pr- < 6.579...(n/7r) 

(n - k - 1)! tt^ 1 ~ v ' ' 



< 4C(2) 



} w^ c(fc+1) 

fc+1 . □ 
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Lemma 4. Let n > s > 4. Let p be an odd prime and let < j < p. Then 

\F p (j)\ < 3(np/7Ty+\ 
Proof. By the previous lemma we have 

\F p (j)\ ^Un/nrVf- 1 -" < 7p^J2Nn) k ^ = 7p s ~\nM {n, f ~ 1 
* — ' £ — ' n/n — 1 

k=0 k=0 ' 

- 3 2 (4/ ^_ 1) ("pA) S+1 = 2.846...(npA) s+1 . □ 

We recall some standard results concerning the complexity of integer and poly- 
nomial arithmetic; all of this may be found in |vzGG03] . 

Let R = Z/2 M Z where M > 1. Addition and subtraction in R require O(M) 
bit operations. Multiplication in R costs 0(M Iog 1+O ' 1 ' ) M) bit operations, using 
O(M) bits of space, via FFT methods. Division in R (where possible) has the same 
asymptotic time and space complexity as multiplication, using Newton's method. 
If G G R[x] is a polynomial of degree s, and x%, . . . ,xt £ R, with t < s, then we 
may simultaneously evaluate G(xi), . . . , G{xt) G R using a fast multipoint eval- 
uation algorithm in 0(sMlog 1+ °^ 1 \sM) logs) bit operations. The simplest such 
algorithms have space complexity O(sMlogs), but this can be reduced to O(sM) 
by the method of |vzGS92[ Lemma 2.1]. 

Now let p be an odd prime, s > 1, and R = Z/p s Z. We assume here that 
p < n and s < n. The results are similar: addition and subtraction in R require 
0(s \ogp) — 0(s log n) bit operations, and multiplication in R costs 0(s log 2+o( - 1 ' n) 
bit operations. 

Finally we mention that the primes p < N may be enumerated by a straightfor- 
ward sieve method in 0(iV 1+0 W) bit operations. 

Proposition 5. Let n > s > 4 and let N < n. Let P be a set of primes with 
3 < p < N for all p G P. Assume that ^2 peP p < s. Then the residues B* 
(mod p s ) may be computed for all p G P simultaneously in 

0(s 2 log 3+o(1) n) 

bit operations, using 

0(s 2 log n) 

bits of space. 

Proof. Let M = [log 2 (3(nA/7r) s+1 )] + 1. For this choice of M, by Lemma H we 
have \F p (j)\ < 2 M /2 for all p G P, < j < p, so to compute F p (j) it suffices to 
determine it modulo 2 M . Note that M — 0(s\ogn). 

We perform the following steps, each of which uses 0(s 2 log n) space. 

Step 1. Compute BJ: for 1 < k < s using (for example) the algorithm of [BH11 . 
This costs 0(s 2 log 2+o(1) s) = 0(s 2 log 2+o(1) n) bit operations. 

Step 2. Compute (?) for 1 < k < s. Using a straightforward algorithm this can 
be done in 0{s 2 log 2 n) bit operations. 

Step 3. Compute the coefficients of F{x), by computing the products (^)-B^ for 
< k < s. Each product needs 0(s log 2+o( - 1 ^ n) bit operations. The total cost is 
0{s 2 log 2+o( ' 1 ^ n) bit operations. 
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Step 4- Compute j/p (mod 2 M ) for each p € P, < j < p. Each division costs 
0(M log 1 " 1-0 ^ 1 - 1 M ) — 0(s log 2+ °^ n) bit operations. Since we have assumed that 
YlpepP — s > the total cost is 0(s 2 log 2+ °^ n) bit operations. 

Step 5. Regarding F(x) as a polynomial in (Z/2 M Z)[x], evaluate simultane- 
ously F(j/p) (mod 2 M ) for all p e P, < j < p. This costs 0(s 2 log 3+o(1) n) bit 
operations. 

Step 6. For each p E P, < j < p, recover F p (j) = p s ~ 1 F(j /p) (mod 2 M ), and 
hence the exact integer F p (j). Since p s_1 < 2 M , we may compute p s_1 , and then 
F p (j), in (9(Mlog 1+o(1) M) = 0(s log 2+o(1) n) bit operations, and thus the total 
cost is 0(s 2 log 2+ ° < - 1 ' n) bit operations. 

Step 7. For each p G P, < j < p, compute j n ~ s (modp s ). Each power 
costs 0((logn)(slog 2+ °^ 1 ^ nj) — 0(s log 3+ °^ n) bit operations, so the total cost is 
0(s 2 \og 3+ °^ n) bit operations. 

Step 8. Use Proposition [5] to recover B* n (mod p s ) for each p 6 P. The cost is 
(9(s 2 log 2+o(1) n) bit operations. □ 

Remark 1. The complexity of step 7 can be improved, by computing first q n ~ s 
(mod p s ) for primes q < p, and then using (ji.72)™ 5 — j\~ 8 ^ or composite 
j = Jij2- This saves a factor of logn in this step, provided that p is not too small, 
say p > n c for any fixed c > 0. This will be the case for almost all primes p used in 
the proof of Theorem [TJ 

Remark 2. In a practical setting, one may wish to replace the ring Z/2 M Z by Z/TZ 
where T is a suitably large integer not divisible by any p <E P. For example, one 
could take T to be a product of many word-sized primes q for which there exist 
efficient number-theoretic transforms modulo q. Under this scheme, the expensive 
evaluation in Step 5 could be performed for each q separately, and then the F p (j) 
could be reconstructed in Step 6 using the Chinese remainder theorem. This ap- 
proach does not change the asymptotic complexity, but potentially yields a drastic 
improvement in memory locality. 

Remark 3. Further practical savings may be realised by using the easily-proved fact 
that B^(l — x) = —B^(x) for even n, so that 

(p-i)/2 

B; i= 2 J2 (-l) j i n - s F P (3) (modp s ). 
3=1 

Coupled with the observation that essentially half of the coefficients of F(x) are 
zero, this leads to a savings of a factor of two in the main evaluation step. 

Now we may prove the main result. 

Proof of Theorem^ Recall that 1/3 < a < 1/2. We will take 

N = [n a log 1 "" nj , s = [2n 1 ~ Q log" n\ . 

We may assume that n is large enough so that n > s > N > 4. In particular we 
may assume that the hypotheses of Proposition [5] are satisfied. 

Let P be the set of odd primes p < N, so that \P\ = 0(N/ log N) = 0{n a log~"n). 
Let r — I2n 1 ~ 2a log 2 " -1 nj . Note that r > 1 for sufficiently large n. Parti- 
tion P into d sets Pi,...,Pd of cardinality at most r, where d = 0{\P\/r) = 
Oin 301 ' 1 log 1 " 3 " n). For each i we have £ P eP s P < \Pi\N <rN <s. 
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Apply Proposition [5] to each set Pi separately. The space usage for each invo- 
cation is 0(s 2 logrt) = 0(n 2 ~ 2a log 2a+1 n). This space may be reused for each Pi. 
The total time cost is 0{ds 2 log 3+o(1) n) = 0(n 1+a log 4 ~ Q+o(1) n). 

At this stage we have computed -B* (mod p s ) for all p G P. This is enough to 
determine B* (for sufficiently large n), because 

log p s = s logp ~ sN — 2n log n + 0(n), 

p£P 3<p<N 

whereas log-B* = nlogn + 0{n). Using fast Chinese remaindering we may then 
recover £*, and hence B n , in 0(n 1+o ^) bit operations. □ 

Remark 4. We sketch an algorithm that improves the time and space complexities 
to respectively 0(n 3 ^ 2 Iog 3+O ^ 1 ' ) n) and O(nlogn) in the case a = 1/2. Consider the 
algorithm of Proposition [5] applied to a set P = {p} consisting of a single prime. 
The evaluation points j/p, for < j < f>, now form an arithmetic progression. 
We relax the condition p < s, instead allowing p as large as slogs. Instead of 
evaluating at all p points simultaneously, we first evaluate at only s points, and 
then use the value-shifting algorithm of [ShoSU Theorem 3.1] (alternatively the 
algorithm of IBGS07I Theorem 5]) to evaluate at the remaining p — s points, in 
blocks of s points at a time. Then in the proof of Theorem [TJ we take s = |_2n 1 / 2 J 
and N = [n 1 / 2 \ogn\, and only use Proposition [5] for one prime at a time. This 
leads to the complexity bounds stated above; we omit the proof, which is similar 
to that of Theorem [TJ 
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